The goal of this project is to obtain and quantify how chemical properties impact the quality Grade of red wine. There are 1599 rows (red wine samples) and 11 variables in the dataset. The wine samples in the dataset are related to red variants of the Portuguese “Vinho Verde” wine, and the variables describe the physicochemical properties of wine. A multiple regression analysis is conducted to identify if and how the 11 independent variables can be used in the model to explain the variation of the quality Grade of a red wine.
First, Some Preliminary explorations are performed:
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
In the above bocplots, the red star representing the mean, while the middle blue line representing the median. Comparig the mean and the median, as well as the histogram on the right, we can see that whenever the data is normally distributed the mean and the median are converging (e.g. pH or density), whereas when the data is skewed the mean and the median are apart (e.g. sulphates or total.sulfur.dioxide). Using the boxplots is also helpful in identifying the outliers which are the dotted points at either sides (Up or Down) of the boxplot tails. Looking through the above boxplots and histograms for each variable, four variables of Fixed acidity, residual sugar, total sulfur dioxide, and free sulfur dioxide appear to have the largest outliers. Therefore, I decided to slice them from their top 1% values.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.600 Min. :0.1200 Min. :0.0000 Min. :0.900
## 1st Qu.: 7.100 1st Qu.:0.3950 1st Qu.:0.0900 1st Qu.:1.900
## Median : 7.900 Median :0.5200 Median :0.2500 Median :2.200
## Mean : 8.259 Mean :0.5288 Mean :0.2661 Mean :2.409
## 3rd Qu.: 9.100 3rd Qu.:0.6400 3rd Qu.:0.4200 3rd Qu.:2.600
## Max. :13.200 Max. :1.5800 Max. :1.0000 Max. :8.300
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 21.25
## Median :0.07900 Median :13.00 Median : 37.00
## Mean :0.08699 Mean :15.17 Mean : 44.52
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 60.00
## Max. :0.61100 Max. :46.00 Max. :144.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9967 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.316 Mean :0.6569 Mean :10.43
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7275 3rd Qu.:11.10
## Max. :1.0029 Max. :4.010 Max. :2.0000 Max. :14.00
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## 'data.frame': 1534 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
The long tailed total sulfur dioxide and sulphates data should be transformed for a more accurate distribution. The log10 transformation can be used to produce a relatively normal distribution for both. Let’s see how the log10 transformation would work for both variables:
As can be observed from the above graph, the log10 transformation works great for sulphates variable.
Same result can be obtained for total sulfur dioxide and comparing the following graphs, log10 transformation seems to be useful.
Fixed acidity and volatile acidity appear to be long tailed as well. Hence, log10 transformation should be be a good option. The following graphs prove this claim:
As we said before, Wine Quality is a categorical variable. We can create a new variable called Grade to group Quality into three distinct categories: bad, average, and excellent.
Bad (Quality < 5)
Average (Quality = 5 or 6)
Excellent (Quality > 6)
Here are a count of the data for each of these three groups:
## bad average excellent
## 62 1264 208
There are 1534 observations left after slicing out the top 1% from the variables that had large outliers for the following variables: Fixed acidity, residual sugar, total sulfur dioxide, and free sulfur dioxide.
Quality is the main feature.
From the graphs I have seen so far, I believe residual sugar, pH, density and alcohol content have key roles in quality and may end up being selected for the final model.
Yes, I created a Grade variable which is a subset of quality based on three distinct categories: (bad: 4,5), (average: 5,6), (excellent: 7,8)
To get a better look at correlations between two pairs of variables, ggpairs was used.
This scatterplot matrix is specifically very useful for the variable selection in the final model. Let’s recall that the main purpose of this analysis is to understand how chemical properties impact the wine quality (Response or Dependent Variable). We are interested to select the independent variables that have the highest correlation with Quality, so that the final model can be stronger in predicting Quality. Furthermore, we should avoid selecting independent variables that have high correlations between themslves, as this can cause multicolinearilty leading to inaccuracy in the estimation of the model parameters (Coeeficients). For instance, having both free.sulfur.dioxide and total.sulfur.dioxide in the model is not suggested, as there is a high correlation between the two variables( > 0.60).
In this section, I investigate the correlations between some of the independent variables.Based on the scatterplot matrix shown above, we notice some interesting relationships between the following variables: Citric Acid and pH (~ -0.53), Citric Acid and Volatile Acidity (~ -0.56). However, none of these variables seem to be strongly correlated to alcohol. Meanwhile, alcohol and quality have a 0.48 correlation coefficient. Hence, alcohol can be a good candidate to be in the final model.
Firstly I ploted pH and fixed acidity. The correlation coefficient is -0.68, meaning that pH tends to decrease as fixed acidity increases, which makes sense chemically speaking.
## [1] -0.6794406
The correlation between citric acid and pH is weaker, as it is calculated as -0.53. This makes sense as citric acid is a subset of fixed acidity.
## [1] -0.5283267
Volatile acidity has a weak positive correlation with pH level (0.23).
## [1] 0.2387919
## [1] -0.5629224
As it can be seen in the graph, there is clearly a negative correlation between volatile acidity and citric acid. Chemically speaking, as volatile acidity is essentially acetic acid, a large amount of both ingredientss would likely not be included in a wine.
## [1] 0.2166557
There is not much relationship between alcohol and pH.
To further explore alcohol, pH, volatile acidity, citric acid, and sulphates and see how they relate to the quality of the wine, Box plots are used and we use the median as a better measure for the variance in the data.
## df$Grade: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.303 3.380 3.385 3.500 3.900
## --------------------------------------------------------
## df$Grade: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.870 3.210 3.310 3.315 3.402 4.010
## --------------------------------------------------------
## df$Grade: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.880 3.200 3.280 3.295 3.380 3.780
## df$Grade: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.60 10.00 10.20 10.97 13.10
## --------------------------------------------------------
## df$Grade: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.50 10.00 10.26 10.90 14.00
## --------------------------------------------------------
## df$Grade: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.50 10.80 11.60 11.54 12.22 14.00
## df$Grade: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.5800 0.6800 0.7306 0.8838 1.5800
## --------------------------------------------------------
## df$Grade: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.4100 0.5400 0.5386 0.6400 1.3300
## --------------------------------------------------------
## df$Grade: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3100 0.3700 0.4090 0.4925 0.9150
## df$Grade: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0200 0.0750 0.1713 0.2675 1.0000
## --------------------------------------------------------
## df$Grade: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2400 0.2538 0.4000 0.7600
## --------------------------------------------------------
## df$Grade: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3000 0.3950 0.3687 0.4900 0.7600
## df$Grade: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4925 0.5600 0.5927 0.6000 2.0000
## --------------------------------------------------------
## df$Grade: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3700 0.5400 0.6100 0.6457 0.7000 1.9800
## --------------------------------------------------------
## df$Grade: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7444 0.8200 1.3600
The boxplots provide a very interesting fact about alcohol: Alcohol content is significantly higher for excellent wines compared to bad or average wines. Sulphates and citric acid also seem to be positively correlated to quality, whearas volatile acidity appear to be negatively correlated with Quality.
It appears that citric acid and sulphates are positively related. Volatile acidity and citric acid are negatively correlated. Citric acid and pH were also negatively correlated. Other interesting observations are as following:
Volatile acidity and citric acid, as well as citric acid and pH were negatively correlated. Fixed acidity and pH were also negatively correlated.
The strongest relationship was between Citric Acid and Volatile Acidity, which had a correlation coefficient of -0.563.
When comparing sulphates to alcohol, it was noticed that for average wines, quality increased typically as sulphates increased. Furthermore, for excellent wines, it appeared that alcohol played a more important role in determining quality given a specific sulphate level.
In this section, I have created some scatterplots for a few variables of interest faceted by quality Grade (Bad, Average and Excellent) to look for relationships and additional insights. It is worth to note that I have used a sequential color table as well as a regression line for each category that can strongly help in depicting the separations.
As it can be seen in the above chart, the range (-0.1,0) for log10 sulphate and the alcohol level of around 12, leads to the best quality score of 8 which is also an Excellent grade.
We know that citric acid affects quality as well. It appeared that at a given level of citric acid, higher alcohol content typically meant greater wines, with the exception of bad wines. It’s likely that the bad wines have a different factor, which masks the benefits of the added alcohol.
I’m interested to learn what variable(s) are responsible for bad wine. From all the observations I have seen so far, I decide to pick chlorides, residual sugar, and volatile acidity to find out if they may cause bad Wines. Since lower citric acids were found in bad, average, and excellent wines, it is used as the test subject to make further inferences.
It can be seen that, for a given level of chlorides, there are many average wines and some excellent wines that also have the same citric acid value. Additionally, most wines have similar levels of chlorides. Hence, chlorides can be off the table.
As it can be seen below, residual sugar content is neither the variable causing bad wines.
The above graph, however, illustrates that most bad wines seem to have higher levels of volatile acidity, and most excellent wines also have lower levels of volatility.
For the upper right cluster under bad wines, it can be seen that the higher alcoholic content of the wines are being masked by the high volatile acidity (0.8 or higher).
Comparing volatile acidity with sulphates, it can be concluded that excellent wines have a lower volatile acidity and a higher sulphates content, whereas bad wines have a lower sulphates content and higher volatile acidity content.
Based on the graphs and analysis performed, I used the four major variables: alcohol, sulphates, citric acid, and volatile acidity to build a linear model by considering Quality as the response variable. The table and graph below displays the results:
##
## Calls:
## m1: lm(formula = as.numeric(quality) ~ alcohol, data = df)
## m2: lm(formula = as.numeric(quality) ~ alcohol + sulphates, data = df)
## m3: lm(formula = as.numeric(quality) ~ alcohol + sulphates + citric.acid,
## data = df)
## m4: lm(formula = as.numeric(quality) ~ alcohol + sulphates + citric.acid +
## volatile.acidity, data = df)
##
## ============================================================================
## m1 m2 m3 m4
## ----------------------------------------------------------------------------
## (Intercept) 1.734*** 1.243*** 1.296*** 2.492***
## (0.179) (0.181) (0.180) (0.207)
## alcohol 0.374*** 0.358*** 0.351*** 0.322***
## (0.017) (0.017) (0.017) (0.016)
## sulphates 1.000*** 0.824*** 0.711***
## (0.104) (0.108) (0.105)
## citric.acid 0.511*** -0.087
## (0.096) (0.108)
## volatile.acidity -1.242***
## (0.117)
## ----------------------------------------------------------------------------
## R-squared 0.239 0.282 0.295 0.344
## adj. R-squared 0.238 0.281 0.294 0.342
## sigma 0.707 0.687 0.681 0.657
## F 479.886 300.740 213.633 200.172
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1644.831 -1599.682 -1585.485 -1530.839
## Deviance 766.824 722.987 709.728 660.922
## AIC 3295.663 3207.363 3180.970 3073.678
## BIC 3311.670 3228.706 3207.648 3105.692
## N 1534 1534 1534 1534
## ============================================================================
As described in the above analysis, four features of alcohol, sulphates, citric acid, volatile acidity are the most important ones to explain the feature of interest: quality. A summary of the relationship between the features are as follows:
Citric acid and Alcohol: There is definitely a relationship between alcohol content and citric acid with respect to the Quality of wine. For instance, lower quality wines tended to be lower in alcohol content and citric acid. Alcohol content made average wines taste better no matter to the citric acid content. Additionally, excellent wines tended to be higher in alcohol content and citric acid.
For average wines, Sulphates versus citric acid showed that sulphates were mainly larger. However, for excellent wines, a higher citric acid content resulted in an excellent wine at a given level of sulphates. One may conclude that citric acid is more important than sulphates with regards to what makes a wine excellent. However, a sulphate content between -0.25 and 0 was necessary in order for a wine to be sufficient. Therefore, this strengthens the idea that low sulphate quality played a key role in average or bad wines.
The relationship between alcohol and volatile acidity was an interesting one, as a low volatile acidity rating appeared to be a requirement in order for a wine to be excellent. There are lots of average wines with volatile acidity between 0.4 and 0.8 and alcohol content between 9 and 10%, whereas most excellent wines have majority of the volatility between 0.1 and 0.4. Bad or average wines were generally over 0.4 volatile acidity no matter what the alcoholic content is.
High volatile acidity and low sulphates were a strong indicator of the presence of bad wine. Higher alcohol content, lower volatile acidity, higher citric acid, and lower sulphates altogether resulted in a good wine.
As explained in the above, a linear model was built by using four of the variables which appeard to be the most important features in describing the feature of interest: Wine Quality. These four variables are: alcohol, citric acid, sulphates, and volatile acidity. For obvious reasons, the model is far from the best possible model, as I used a linear model for simplicity. However, more advanced analysis may need to be performed to obtain the best model.
This graph illustrated that a higher alcohol content needed in general for excellent wines. The jump from an average wine to an excellent wine typically requires an alcohol level of close to 12 and more. It should be noted that other factors should not be ignored, as we will see in plot 2 and 3.
This graph clearly shows what was stated at last for the description of plot one. More specifically, higher level of alcohol is necessary for a good Quality wine but that is not sufficient. As it can be seen in the above chart, at volatile acidity level of greater than 0.8, the increase in the alcohol level would not impact the wine quality from the bad grade. Furthermore, a volatile.acidity level of between 0.4 to 0.8 typically results in an average wine, and for volatile.acidity level of less than 0.4, the famous jump of quality from average to excellent as a result of alcohol increase (similar to the one seen in Plot One) can be observed again.
From this graph, it can be seen that lower sulphates content typically leads to a bad wine where alcohol varying between 10% and 12%. Furthermore, average wines have higher sulphates in general. Nevertheless, alcohol content still plays a role and need to be higher as well, for higher Sulphates resulting in an average wine. Lastly, excellent wines are mostly clustered around higher alcohol contents (11-12%) as well as higher sulphate contents (-0.1,0) (for log10 sulphate).
When I learned about this dataset, it automathically made me interested, as in general I like Wines and it was very interesting to learn in that much details about the ingredients and which are the main ones to affect the Quality of Wines. Overal, I believe this was a successful data analysis experience since I was able to throughly explore different features and compare them with respect to the feature of interest, the wine quality variable, build a simple model off of the most important features and obtain a fearly clear understanding of the factors that makes a quality wine. As for struggles, I can say that although there were not too many variables/features, it was not an easy task for me to find the most important and insightful selection of variables to be plotted or analyzed, and I needed several trials to obtain an insightful analysis. I believe that is the main part of the fun though! The next steps for me would be to try more advanced models on the data such as KNN or Decision Tree models.
Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, José Reis, Modeling wine preferences by data mining from physicochemical properties, In Decision Support Systems, Volume 47, Issue 4, 2009, Pages 547-553,ISSN
0167-9236, https://doi.org/10.1016/j.dss.2009.05.016.
(http://www.sciencedirect.com/science/article/pii/S0167923609001377)
Dataset link: http://www3.dsi.uminho.pt/pcortez/dss09.bib
http://datadrivenjournalism.net/resources/when_should_i_use_logarithmic_ scales_in_my_charts_and_graphs
https://www.r-bloggers.com/multiple-regression-lines-in-ggpairs/